Flare: Native Compilation for Heterogeneous Workloads in Apache Spark

نویسندگان

Grégory M. Essertel

Ruby Y. Tahboub

James M. Decker

Kevin J. Brown

Kunle Olukotun

Tiark Rompf

چکیده

The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which promise an increase in expressiveness and performance. But how good are these extensions at extracting high performance from modern hardware platforms? While Spark has made impressive progress, we show that for relational workloads, there is still a significant gap compared with best-of-breed query engines. And when stepping outside of the relational world, query optimization techniques are ineffective if large parts of a computation have to be treated as user-defined functions (UDFs). We present Flare: a new back-end for Spark that brings performance closer to the best SQL engines, without giving up the added expressiveness of Spark. We demonstrate order of magnitude speedups both for relational workloads such as TPC-H, as well as for a range of machine learning kernels that combine relational and iterative functional processing. Flare achieves these results through (1) compilation to native code, (2) replacing parts of the Spark runtime system, and (3) extending the scope of optimization and code generation to large classes of UDFs.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Reference Architecture and Road map for Enabling E- commerce on Apache Spark

Apache Spark is an execution engine that besides working as an isolated distributed, in-memory computing engine also offers close integration with Hadoop’s distributed file system (HDFS). Apache Spark's underlying appeal is in providing a unified framework to create sophisticated applications involving workloads. It unifies multiple workloads, handles unstructured data very well and has easy-to...

متن کامل

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We ...

متن کامل

SnappyData: A Unified Cluster for Streaming, Transactions and Interactice Analytics

Many modern applications are a mixture of streaming, transactional and analytical workloads. However, traditional data platforms are each designed for supporting a specific type of workload. The lack of a single platform to support all these workloads has forced users to combine disparate products in custom ways. The common practice of stitching heterogeneous environments has caused enormous pr...

متن کامل

Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++

We present the design and a first performance evaluation of Thrill – a prototype of a general purpose big data processing framework with a convenient data-flow style programming interface. Thrill is somewhat similar to Apache Spark and Apache Flink with at least two main differences. First, Thrill is based on C++ which enables performance advantages due to direct native code compilation, a more...

متن کامل

Characterizing the Performance of Analytics Workloads on the Cray XC40

This paper describes an investigation of the performance characteristics of high performance data analytics (HPDA) workloads on the Cray XC40TM, with a focus on commonly-used open source analytics frameworks like Apache Spark. We look at two types of Spark workloads: the Spark benchmarks from the Intel HiBench 4.0 suite and a CX matrix decomposition algorithm. We study performance from both the...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1703.08219 شماره

صفحات -

تاریخ انتشار 2017

Flare: Native Compilation for Heterogeneous Workloads in Apache Spark

نویسندگان

چکیده

منابع مشابه

A Reference Architecture and Road map for Enabling E- commerce on Apache Spark

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

SnappyData: A Unified Cluster for Streaming, Transactions and Interactice Analytics

Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++

Characterizing the Performance of Analytics Workloads on the Cray XC40

عنوان ژورنال:

اشتراک گذاری